{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.6",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "colab": {
      "name": "26 - Entity extraction workflows",
      "provenance": [],
      "collapsed_sections": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "POWZoSJR6XzK"
      },
      "source": [
        "# Entity extraction workflows\n",
        "\n",
        "Entity extraction is the process of identifying names, locations, organizations and other entity-like tokens in unstructured text. Entity extraction can organize data into topics and/or feed downstream machine learning pipelines.\n",
        "\n",
        "This notebook will show how to use the entity extraction pipeline in txtai with workflows."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qa_PPKVX6XzN"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
        "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
        "trusted": true,
        "_kg_hide-output": true,
        "id": "24q-1n5i6XzQ"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Extract entities\n",
        "\n",
        "Let's get right to it! The following example creates an entity pipeline and extracts entities from text. \n"
      ],
      "metadata": {
        "id": "0p3WCDniUths"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
        "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
        "trusted": true,
        "id": "2j_CFGDR6Xzp",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "6276538e-5b06-4038-ae0f-0aa642e86814"
      },
      "source": [
        "from txtai.pipeline import Entity\n",
        "\n",
        "data = [\"US tops 5 million confirmed virus cases\",\n",
        "        \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
        "        \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
        "        \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
        "        \"Maine man wins $1M from $25 lottery ticket\",\n",
        "        \"Make huge profits without work, earn up to $100,000 a day\"]\n",
        "\n",
        "entity = Entity()\n",
        "\n",
        "for x, e in enumerate(entity(data)):\n",
        "  print(data[x])\n",
        "  print(f\"  {e}\", \"\\n\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "US tops 5 million confirmed virus cases\n",
            "  [('US', 'LOC', 0.999273955821991)] \n",
            "\n",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\n",
            "  [('Canada', 'LOC', 0.999609649181366), ('Manhattan', 'MISC', 0.651396632194519)] \n",
            "\n",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
            "  [('Beijing', 'LOC', 0.9996659755706787), ('Taiwan', 'LOC', 0.9996755123138428)] \n",
            "\n",
            "The National Park Service warns against sacrificing slower friends in a bear attack\n",
            "  [('National Park Service', 'ORG', 0.9993489384651184)] \n",
            "\n",
            "Maine man wins $1M from $25 lottery ticket\n",
            "  [('Maine', 'LOC', 0.9987521171569824)] \n",
            "\n",
            "Make huge profits without work, earn up to $100,000 a day\n",
            "  [] \n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The section above is running an entity extraction pipeline for each row in data. The outputs are the token(s) identified as part of an entity, the type of entity and score or confidence in the prediction."
      ],
      "metadata": {
        "id": "bVd_qHh97IlJ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Feed entities to a workflow\n",
        "\n",
        "The next section demonstrates how the entity extraction pipeline can be used as part of a workflow. This workflow uses the output entities and builds an embeddings index for each row. This effectively computes entity embeddings to compare the row similarity with a focus on mentioned entities."
      ],
      "metadata": {
        "id": "fY71Dyyt8Arv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.embeddings import Embeddings, Documents\n",
        "from txtai.workflow import Workflow, Task\n",
        "\n",
        "# Create workflow with an entity pipeline output into a documents collection\n",
        "documents = Documents()\n",
        "workflow = Workflow([Task(lambda x: entity(x, flatten=True, join=True)), Task(documents.add, unpack=False)])\n",
        "\n",
        "# Run workflow\n",
        "for _ in workflow([(x, row, None) for x, row in enumerate(data)]):\n",
        "  pass\n",
        "\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\"})\n",
        "embeddings.index(documents)\n",
        "\n",
        "for query in [\"North America\", \"Asia Pacific\"]:\n",
        "  index = embeddings.search(query, 1)[0][0]\n",
        "  print(query, \"\\t\", data[index])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "yB3e6HVM8dRJ",
        "outputId": "8424ce5d-9652-421d-d0bd-261b911ea7d6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "North America \t Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\n",
            "Asia Pacific \t Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Run workflow YAML\n",
        "\n",
        "Below is the same example using workflow YAML."
      ],
      "metadata": {
        "id": "u_0WiV1NvbCK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "workflow = \"\"\"\n",
        "writable: true\n",
        "embeddings:\n",
        "  path: sentence-transformers/nli-mpnet-base-v2\n",
        "\n",
        "entity:\n",
        "\n",
        "workflow:\n",
        "  index:\n",
        "    tasks:\n",
        "      - action: entity\n",
        "        args: [null, \"simple\", true, true]\n",
        "      - action: index\n",
        "\"\"\"\n",
        "\n",
        "from txtai.app import Application\n",
        "\n",
        "# Create and run workflow\n",
        "app = Application(workflow)\n",
        "for _ in app.workflow(\"index\", [(x, row, None) for x, row in enumerate(data)]):\n",
        "  pass\n",
        "\n",
        "# Run queries\n",
        "for query in [\"North America\", \"Asia Pacific\"]:\n",
        "  index = app.search(query)[0][\"id\"]\n",
        "  print(query, \"\\t\", data[index])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3qmRdU7LvhQ3",
        "outputId": "e01453fc-5bbf-41a6-e557-64fa453f80ba"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "North America \t Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\n",
            "Asia Pacific \t Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook introduced entity extraction pipelines with txtai. This pipeline supports a number of different configurations to help feed downstream systems and/or directly use the entities.\n",
        "\n",
        "As with other pipelines, the entity extraction pipeline can be used standalone in Python, as an API service or as part of a workflow!"
      ],
      "metadata": {
        "id": "i3kkN5Vpx8ks"
      }
    }
  ]
}
